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MAPPING  AN  ADAPTABLE  EIGENSTRUCTURE  TECHNIQUE 
ONTO  THE  TOPOLOGIX  HYPERCUBE 

INTRODUCTION 

Linear  algebra  techniques,  such  as  matrix-matrix  multiplication  and  eigenstructure 
decomposition,  have  been  studied  on  sequential  computer  systems  for  many  years.  Some  of  the 
algorithms  used  in  these  applications,  most  of  which  have  been  widely  published,  include  BLAS 
I,  II,  and  HI;  the  Linear  Algebra  Package  (LAPACK);  and  the  revised  Linear  Algebra  Package 
(LINPACK).  In  recent  years,  it  has  been  determined  that  parallel  implementations  of  the  best 
sequential  algorithms  for  linear  algebra  provide  the  optimal  criteria  for  measuring  the  performance 
of  future  algorithms  and  for  evaluating  new  parallel  architectures.  Such  parallel  implementations 
also  provide  accuracy  and  stability  to  signal  processing  applications. 


BACKGROUND 

The  capability  of  adaptive  beamforming  (ABF)  to  maximize  signal-to-noise  ratio  (SNR)  is 
well-known.  One  of  the  many  techniques  is  the  minimum  variance  distortionless  response 
(MVDR)  algorithm.1  First  documented  in  the  early  1960s,  the  MVDR  technique  was  not 
immediately  accepted  because  of  its  computationally  intensive  nature.  This  problem  was  solved 
with  the  arrival  of  very  large  scale  integration  (VLSI)  technology,  which  increased  the  available 
processing  power. 

Previous  MVDR  implementations2  showed  that  the  algorithm  becomes  unstable  for  large 
SNR  values.  The  instability  occurs  when  correlation  exists  between  sensor  outputs,  making  the 
covariance  matrix  ill-conditioned.  An  ill-conditioned  matrix  (one  where  there  is  a  large  difference 
between  smallest  and  largest  eigenvalues)  is  noninvertible.  Eigenstructure  techniques,  however, 
allow  some  correlation  to  exist  without  affecting  the  stability  of  the  algorithm.3 
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RESOLUTION 

Eigenstructure-based  analysis  methods,  which  evolved  concurrently  with  MVDR  since  the 
early  1970s,  have  led  to  current  high  resolution  procedures.  Owsley4  developed  a  single  approach 
called  the  Enhanced  MVDR  (EMVDR)  beamformer  that  used  advantages  from  both  techniques. 
This  approach  provides  a  compromise  between  high  resolution  direction-finding  algorithms  (such 
as  the  multiple  emitter  signal  location  wavenumber  spectrum  estimator  (MUSIC))  and  ABF 
algorithms  (such  as  MVDR)  by  using  estimated  eigenstructures  rather  than  the  Cholesky 
factorizations  of  MVDR. 


PARALLEL  IMPLEMENTATION  RATIONALE 

As  mentioned  earlier,  signal  processing  applications  requiring  linear  algebra  were  limited 
by  the  throughput  provided  with  sequential  computer  systems.  The  use  of  VLSI  technology  in 
sequential  computers  was  still  not  enough  to  provide  the  required  speed.  However,  new  highly 
parallel  architectures  can  support  the  high  throughput  with  concurrent  processing,  distributed 
memory,  and  efficient  processor  networking. 

Parallel  processing  increases  throughput  by  breaking  a  problem  into  subparts  and  then 
executing  them  concurrently.  Multiple  processors  that  achieve  high  levels  of  performance  with 
high  levels  of  interconnectivity  have  also  been  developed  in  the  past  few  years.  Many  of  these 
systems  use  a  hypercube  pattern.^  such  as  the  one  for  this  project  (manufactured  by  Topologix, 
Inc.),  which  connects  16  INMOS  T800  Transputers  in  the  hypercube  configuration. 

This  report  describes  the  implementation  of  a  QR  algorithm  on  the  Topologix  hypercube 
architecture.  The  QR  algorithm,  which  provides  a  very  fast  convergence  rate,  may  be  used  to 
calculate  the  eigenstructure  for  the  EMVDR  beamformer.  The  first  section  of  the  report  provides 
an  overview  of  the  EMVDR  algorithm.  Following  sections  describe  the  QR  factorization,  an 
adaptable  eigenvector/eigenvalue  decomposition  using  the  QR  factorization,  and  the  Topologix 
hypercube  architecture  and  special  software  routines.  In  the  final  section,  observations  about  the 
implementation  of  the  algorithm  are  discussed.  The  source  code  for  the  EMVDR  beamformer  with 
the  adaptable  QR  algorithm  is  included  as  appendix  A.  Appendix  B  contains  the  source  code  for 
broadcast  and  exchange  routines. 
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ENHANCED  MINIMUM  VARIANCE  ALGORITHM 

The  spatial  cross-spectral  density  matrix  (CSDM)  at  a  frequency  to  for  an  N  sensor  array 
with  K  sources  is  written  as 


C  =  ojMAMH  +  a£lN, 

where  M  is  an  N  x  K  matrix  of  orthonormal  eigenvectors  of  the  source  subspace,  A  is  a  diagonal 
matrix  of  the  eigenvalues  associated  with  the  eigenvectors,  and  In  is  an  N  x  N  identity  matrix.  The 
values  os  and  o0  are  the  spectral  densities,  respectively,  of  the  source  and  noise.  The  enhanced 
CSDM  is  defined  with  the  scalar  enhancement  factor  (e)4  as 

C(e)=eo?MAMH  +  c£lN.  (1) 

The  factor  e  (>  1)  makes  the  source  subspace  louder  than  the  noise  subspace,  thus  allowing  the 
sources  to  dominate  in  the  CSDM.  Equation  (1)  can  be  rewritten  as  6 


S 

C(e,s)  =  e  ^  minim*1  +  IN 


i=l 


where 

Pi  =  ^eigenvalue, 
jBi  =  corresponding  eigenvector,  and 
s  =  dominant  number  of  sources. 


(2) 


Inverting  equation  (2)  yields 


C(e,s)  1  =  iN  -  \  A  C^‘  niiinf  •  (3) 

1  +  epj 
i=l 
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The  weight  vector  for  steering  direction  d  is  defined  by 


w(e,s>- 

dH  C(e.s)1  d 


(4) 


Combining  equations  (3)  and  (4),  we  get  the  null  steerer  equation 


<*N 


s 

I 


e^i  H 

; - m^nii  )  sL 

1  +  cHj 


w(e,s)  = 


i=l 


N- 


5 

I 


i=l 


1  +  eHj 


(5) 


The  solution  to  equation  (5)  is  used  in  the  EMVDR  beamformer  equation 

X  =  w(e,s)H  x  , 

where  £  is  the  frequency  domain  input  data  vector.  The  EMVDR  algorithm  exploits  the  fact  that 
threshold  signals  (eigenvalues  near  noise  variance)  do  not  need  to  be  captured  by  the  estimated 
CSDM  because  they  are  filtered  out  by  the  sidelobes.  The  EMVDR  approach  requires  only  high 
level  signals  (interferences)  to  obtain  nearly  optimum  detection  performance.4*7 


4 


TR  8807 


QR  FACTORIZATION 

The  QR  factorization  is  the  most  popular  technique  for  evaluating  the  complete  set  of 
eigenvalues  and  eigenvectors  of  a  Hermitian  or  symmetric  matrix.8*9*10  It  is  used  by  most 
numerical  analysis  subroutines  because  it  offers  extremely  fast  convergence  rates.  The  QR 
factorization  is  a  matrix  realization  of  the  Gram-Schmidt  orthonormalization  process.8  The 
theorem  is  stated  as  follows: 

If  A  is  a  complex  nx  n  matrix,  then  there  is  a  unitary  matrix  Q  (QH  -  Q'1 )  and  an  upper 
triangular  matrix  R,  such  that  A  =  QR.  If  A  is  nonsingular  (its  inverse  exists),  then  Q  and  R  can  be 
uniquely  determined. 

Householder  transformations  are  accumulated  to  form  the  Q  matrix3-9 


Q  =  H!H2H3...Hk  . 


(6) 


Each  Householder  matrix  Hk  is  calculated  according  to  the  form 


where 


Pk 


(7) 


pk  =  first  element  in 
a 

^  ^  <r  Hall 

=  column  k  of  identity  matrix, 

a  =  column  of  original  matrix  (A), 

11^11  =  Euclidean  norm  of  column  a,  and 
o  =  ±1,  depending  on  sign  of  first  component  of  a- 

Each  Hk  is  chosen  such  that  it  will  introduce  zeros  into  the  last  n-k  components  of  the  k*  column 
of  A.  The  original  matrix  is  transformed  into  an  upper  triangular  matrix  R  by  premultiplying  A  by 
each  Hjc.  For  example,  if  A  originally  has  the  form 


5 


TR  8807 


"x  xxx" 

0  x  X  X 

A  =  0  X  X  X 

-  0  X  X  X  _ 

then  the  transform  Hj  is  determined  to  clear  the  last  n-1  components  in  the  first  column  of  Hi  A. 
When  the  transform  is  applied  to  A,  zeros  are  introduced  into  the  boxed  elements  as  follows: 


HjA 


ril  rl2  ri3  f14 

0  x  x  x 

0  0  x  x 

0  0  x  x 


where  rjj  is  an  element  of  the  final  R  matrix  (upper  triangular).  The  matrix  H2  is  chosen  to 

introduce  zeros  into  the  n-2  boxed  elements  in  the  second  column.  The  first  component  of  U2  that 
determines  H2  is  zero;  therefore,  the  components  labeled  rjj  and  0  are  undisturbed  by  the 
application  of  H2  to  Hi  A.  The  reduction  is  complete  when  the  matrix  A  is  upper  triangular.  The 
result  is 


H4H3H2H1A  = 


rn 

ri2 

ri3 

rU  ‘ 

0 

r22 

r23 

f24 

0 

0 

r33 

r34 

0 

0 

0 

— 1 

The  basic  QR  iteration  follows:10  Given  an  n  x  n  matrix  Ao  whose  eigenstructure  is 
desired,  a  unitary  matrix  Qk  and  an  upper  triangular  matrix  Rk  are  found,  such  that 


Ak  —  QkRk  • 


(8) 


The  next  matrix  in  the  iterative  sequence  is  formed  from  the  product 


Ak+l  =  RkQk  • 


(9) 


The  algorithm  is  described  by  the  compact  equation 


RkQk  =  Qk+lRk+l  • 
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The  sequence  of  matrices  Ak  will  converge  to  a  diagonal  or  upper  triangular  matrix  with  the 
eigenvalues  of  the  matrix  Ao  as  diagonal  entries  arranged  in  descending  order. 

The  QR  factorization  is  very  efficient,  but  it  is  not  directly  applicable  to  an  adaptively 
updated  covariance  matrix  because  each  iterative  step  does  not  involve  the  original  matrix.  In  an 
adaptive  environment,  the  matrix  Ao  is  estimated  by  averaging  over  a  sequence  of  observed 
samples  and  therefore  varies  with  time.  After  each  time  update,  the  QR  factorization  must  be 
performed  on  a  fully  dense  matrix.  It  is  therefore  desirable  to  adapt  the  estimates  of  the 
eigenstructure  (upper  triangular  matrix)  as  each  new  data  sample  is  available. 
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ADAPTIVE  QR  ALGORITHM 

We  can  examine  the  role  that  the  initial  matrix  Ao  plays  in  the  QR  iteration  to  develop  an 
adaptive  version  of  the  QR  algorithm.10  From  equation  (8),  we  know  that 

Q”  Ak  =  Rk  (10) 

Using  equation  (10),  we  express  the  QR  sequence  of  equation  (9)  as 

Ahi  =  RkQk=Qk  \Qk  (ii) 

Repeated  application  of  equation  (11)  shows  the  relationship  of  Ak+1  to  the  original  matrix 
A0as 

■^k+i  =  A0  Tk  ,  (12) 

where  T*  is  the  accumulation  of  Q  factors,  T*  =  QoQi  -Qk-  In  the  limit  k  -» infinity,  the  columns 
of  T  converge  to  the  eigenvectors  of  the  original  matrix  (Ao).  When  k  =  K,  where  K  is  an  arbitrary 
number  greater  than  zero,  the  eigenvalues  (Ak)  and  the  eigenvectors  (Tr)  of  the  original  matrix  Ao 
can  be  used  in  the  EMVDR  null  steerer  shown  in  equation  (5). 

The  matrix  Ao  varies  with  time  as  new  data  samples  become  available  in  an  adaptive 
application  such  as  EMVDR.  It  can  be  updated  by  means  of  an  additive  rank-one  modification: 

Ao,t+i  =  (1  -  a)A0,t  +  a^x”  !  •  (13) 

where 

t  =  update  time, 

x^j  =  new  data  vector  from  sensors,  and 
0  <  a  <  1  =  constant  exponential  weighting  factor. 
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When  the  QR  iteration  has  reached  step  k  =  K,  the  current  QR  product  matrix  A^t  can  be  replaced 
by  the  updated  matrix  Ao,t+i  by  use  of  equation  (12): 

Aiu+i  =  T£.u  A0,t+1  TK-l.t  . 

The  second  subscript  on  T  indicates  that  it  now  depends  on  the  update  time;  i.e.,  Tjc.t  is  the 
transformation  from  Ao,t  to  A^. 

Use  of  equation  (13)  allows  the  new  data  update  of  equation  (14)  to  be  written  in  the  form 

2t+l  =  ^k-l.t^t+l 

Ak.t+1  =  (1  '  <X)Aic,t  +  Ct2l+jZt+j  , 


which  is  shown  graphically  in  figure  1.  Premultiplying  the  new  data  vector  by  the 
accumulation  matrix  tJ?  j  t  ensures  that  the  next  iteration  will  converge  to  an  upper  triangular  result 

that  is  similar  to  the  original  matrix.  The  complete  adaptive  QR  algorithm  uses  two  simultaneous 
recursions.  The  first  set  of  recursions  performs  the  QR  iteration: 

Ak,t  Qk,tRk,l 

Ak+l,t  =  Rk,tQk,t  (  *  6) 

Tk,t  =  Tk.i.tQka . 

The  second  recursion  is  given  by  equation  (15)  and  is  invoked  whenever  a  new  data  vector 
becomes  available.  The  entire  process  is  illustrated  in  figure  2. 


l-a  A 
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Figured  Adaptable QR Algorithm 
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TOPOLOGIX  MULTICOMPUTER  SYSTEM 
HYPERCUBE  ARCHITECTURE 

The  hypercube  architecture  was  chosen  because  it  has  unique  qualities  that  are  appropriate 
to  both  the  QR  and  EMVDR  algorithms: 


1 .  Multiple  instructions  and  multiple  data 

The  architecture  enables  all  processors  to  execute  similar  application  programs  on  different 
data  sets  without  instruction  synchronization.  The  program  and  data  are  stored  in  local 
memory,  so  that  computation  is  within  the  individual  processor.  This  particular  advantage 
allows  each  processor  to  perform  the  reduced  dimensioned  EMVDR  (using  largest 
eigenvalues/eigenvectors)  algorithm  for  a  different  beam  direction,  which  produces  n 
weight  vectors  (for  an  architecture  with  n  processors)  simultaneously. 

2.  Multiple  topologies 

The  architecture  can  emulate  other  network  topologies,  such  as  linear,  mesh,  torus,  or 
tree,  which  allows  programs  to  change  the  topology  "on-the-fly"  so  that  functions  are 
performed  more  efficiently.  For  example,  the  architecture  could  be  configured  as  a  mesh 
(through  software)  to  perform  the  QR  algorithm,  but  could  still  use  the  hypercube  network 
to  broadcast  global  information. 

3.  Global  data  passing  (with  minimal  communications) 

Another  advantage  of  the  architecture  is  its  small  diameter.  Any  processor  can  send  data  to 
any  other  in,  at  most,  d  steps  (for  a  d-dimensional  hypercube).  Consider  a  13-dimensional 
hypercube  containing  8192  processors.  The  maximum  number  of  steps  for  a  message  to 
travel  between  any  two  processors  is  13.  Compared  with  either  a  linear  or  mesh 
configuration,  communication  is  significantly  reduced  for  broadcasting  data. 


The  Topologix  is  a  multicomputer  system  that  can  be  configured  into  a  hypercube 
interconnection  network.  The  address  of  a  processor  is  a  binary  number  of  length  d.  Each 
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processor  has  a  link  to  its  nearest  neighbors.  A  nearest  neighbor  is  a  processor  whose  address 
differs  by  exactly  1  bit  (Gray  code).  The  topology  guarantees  that  no  two  processors  are  more 
than  d  links  apart.  Table  1  identifies  features  of  the  INMOS  T800  transputer  chip  used  in  the 
Topologix  system.  Figure  3  shows  the  Topologix  four-dimensional  hypercube  configuration  and 
corresponding  binary  addresses. 


Table  1.  INMOS  T800  Transputer  Features 


T800  Features 

26  MHz  clock  rate 
32  Bit  word 
4  Communication  links 
20  Mbits/sec  I/O  per  link 
4  Mbytes  RAM  per  processor 
14  RISC  MIPs  (peak)  instruction  rate,  single  precision 
2.2  MFLOPs  (peak)  instruction  rate,  single  precision 


PROGRAMMING 

Converting  a  problem  from  a  sequential  system  (single  computer)  to  a  multicomputer 
system  is  not  a  straightforward  task.  On  a  sequential  machine,  there  is  no  concern  over  message 
passing,  synchronization,  or  load  balancing,  whereas  on  a  multicomputer,  these  factors  are 
important. 1 1  Mapping  an  algorithm  onto  a  multicomputer  includes  making  decisions  dependent  on 
communication  and  processing  operations.12  For  example, 


•  How  should  data  be  distributed  across  local  memories? 

•  Should  each  processor  repeat  identical  computations  or  share  results? 

•  How  many  processors  should  be  used  for  the  problem? 


Algorithms  requiring  frequent  data  exchanges  may  not  perform  well  on  some  hypercube 
multicomputers  because  of  low  communication  bandwidths.  Replicating  data  and  operations 
across  processors  or  decomposing  the  problems  into  larger  parts  can  reduce  communications.13 
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Figure  3.  Four-Dimensional  Hypercube 


HOUSEHOLDER  REDUCTION 

Communication  in  the  sequential  QR  algorithm  is  quite  intensive  because  of  the 
Householder  reduction.9  Columns  of  the  original  matrix  were  mapped  onto  processors  in  a 
wrapped  fashion  to  reduce  communications.  Wrapping  resulted  in  column  j  residing  in  processor 
O'  mod p),  where  p  is  the  number  of  processors  used  for  the  problem  (16  in  this  case). 

Communication  is  needed  to  distribute  the  vector  u*  to  all  processors.  Sending  &  to  all 
processors  enables  the  calculation  HkC  to  be  performed  in  a  parallel  fashion.  In  the  sequential 
version  of  the  reduction  operation,  equation  (7)  must  be  multiplied  by  the  (n-k)  columns  to  the 
right  of  column  k.  The  parallel  version  performs  most  of  the  reduction  simultaneously.  Every 
processor  takes  its  turn  to  generate  the  vector  and  to  broadcast  it  to  every  other  processor.  Each 
processor  holds  n/p  columns  of  the  upper  triangular  matrix  R  when  the  reduction  is  complete.  An 
exchange  routine  (discussed  later)  assembles  the  completed  R  matrix  in  each  processor. 
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QR  FACTORIZATION 

The  number  of  multiplications  necessary  to  solve  equation  (6)  is  on  the  order  of  n4 
operations.  It  is  desirable  to  perform  most  of  the  matrix  multiplications  simultaneously.  To 
perform  the  accumulation  process  of  equation  (6),  the  Topologix  was  configured  in  a  mesh 
topology  through  software  controls.  With  a  mesh  topology,  multiple  pipelines  can  perform  the 
process  in  parallel.  Figure  4  illustrates  the  mesh  topology  used  for  the  accumulation.  Each 
processor  multiplies  HkHk+j  (except  right  boundary  processors).  Once  the  accumulation  is 
complete,  processor  0  holds  the  entire  Q  matrix,  which  is  then  sent  to  all  processors.  Since  each 
processor  holds  the  entire  Q  and  R  matrices,  the  communication  costs  of  using  these  results  in  later 
steps  of  the  algorithm  are  reduced. 


Figure  4.  Communication  Flow  for  Q  Matrix  Accumulation 


MATRIX-MATRIX  MULTIPLICATION 

Matrix  multiplication  is  one  of  the  simplest  parallelization  problems  and  has  been  studied 
extensively  for  execution  on  parallel  architectures  using  various  decomposition  methods. 
Decomposition  involves  dividing  a  matrix  into  subunits  and  assigning  each  subunit  to  a  processor. 
A  subunit  can  be  a  row,  column,  element,  or  smaller  matrix. 
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A  square  sub-block  decomposition14  was  performed  because  the  Topologix  was  already 
configured  in  a  mesh  topology  (to  perform  Q  accumulation).  The  sub-block  decomposition  did  not 
require  communication  because  each  processor  already  held  the  R  and  Q  matrices.  Each  processor 
performed  the  following  matrix  multiply: 


Amk  =  X  Rmn  Q 


nk 


where  A"*,  R™1,  and  are  all  square  submatrices.  Each  processor  performs  n3/16  operations 
to  obtain  the  submatrix  result.  The  submatrices  are  then  assembled  and  the  resulting  matrix  is 
broadcast  to  all  processors. 


COMMUNICATION  ROUTINES 

To  perform  the  QR  algorithm,  it  was  necessary  to  devise  efficient  ways  to  pass  and 
exchange  data  among  processors  on  the  Topologix  system.  Two  routines  were  developed  to  aid  in 
data  movement 

Broadcast 

In  this  scenario,  where  one  processor  must  efficiently  distribute  results  to  all  processors, 
the  hypercube  configuration  solves  the  problem.  In  a  hypercube,  each  node  has  d  neighbors  (d- 
dimensional)  with  addresses  that  differ  by  exactly  1  bit.  The  bit  positions  correspond  to  port 
numbers  (e.g.,  0000  to  0001  over  port  0  and  0000  to  0100  over  port  2).  The  broadcast  routine 
uses  these  bit  positions  to  send  data  out  all  ports: 

For  b  =  0 tod-1 

1 .  Processors  are  divided  into  cubes  according  to  the  value  of  bit  position  b. 

2.  Processors  in  corresponding  positions  send! receive  data. 

The  broadcast  routine  begins  with  one  processor  (2°)  having  the  correct  data.  After  each  iteration, 
b,  the  correct  data  is  held  in  2b  processors.  After  four  iterations,  16  processors  hold  the  correct 
data.  Table  1  shows  the  measured  execution  times  for  various  matrix  sizes.  The  source  code  for 
the  broadcast  routine  is  included  in  appendix  B. 
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Exchange 

An  operation  to  assemble  scattered  data  and  then  distribute  the  result  to  every  processor  will 
reduce  overall  communications.  The  exchange  routine  performs  both  steps  simultaneously  using  a 
concept  similar  to  the  broadcast  routine: 

For  b-0 to d-1 

1.  Processors  are  divided  into  two  (d-1)  cubes  according  to  value  of  bit  b. 

2.  Processors  in  corresponding  positions  exchange  data. 

3.  Each  processor  concatenates  its  own  data  with  the  received  data  to  form  new  data. 

The  exchange  routine  assumes  that  the  data  held  in  each  processor  are  one  or  more  columns  of  the 
resultant  matrix.  The  concatenation  is  performed  by  adding  the  column  vectors  to  a  matrix 
(originally  full  of  zeros).  After  four  iterations,  every  processor  holds  the  completed  matrix.  Table 
2  shows  the  measured  execution  times  for  various  matrix  sizes.  The  source  code  for  the  exchange 
routine  is  included  in  appendix  B. 

Table  2.  Measured  Execution  Times  for  Broadcast  and  Exchange  Routines 


Matrix  size 

Bytes 

Broadcast  (sec) 

Exchange  (sec) 

4x4 

128 

.0032 

.0102 

8x8 

512 

.0034 

.0208 

16  x  16 

2048 

.0072 

.0498 

32x32 

8192 

.0144 

.1654 
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CONCLUSIONS  AND  RECOMMENDATIONS 

The  parallel  adaptive  QR  algorithm  was  run  on  the  Topologix,  and  the  execution  times  are 
summarized  in  table  3.  The  table  shows  only  two  matrix  sizes  because  the  Topologix  has  a 
communication  capacity  of  8-kbyte  blocks;  therefore  the  QR  algorithm  (as  implemented)  could  not 
handle  complex  matrices  larger  than  32  x  32.  Even  with  the  small  size  used  for  the  test  problem, 
the  results  showed  improvement  over  the  best  sequential  method.  The  sequential  method  was 
executed  on  a  single  Topologix  processor. 

Table  3.  Measured  Execution  Times  for  QR  Algorithms 


Matrix  size 

Processors 

16  x  16 

16 

1.24 

1.21 

32x32 

16 

9.12 

15.11 

Table  4  shows  the  order  of  operations  for  the  parallel  and  sequential  QR  algorithms 
(including  two  matrix  multiplies).  Estimates  of  larger  matrix  decompositions  were  not  made 
because  the  cause  of  the  overhead  time  (measured  -  ideal)  was  unknown.  The  table  shows  that  the 
computation  grows  faster  than  the  communication  for  increasing  n.  However,  because  there  are 
many  other  sources  of  overhead  that  also  appear  when  larger  problems  are  distributed  over 
concurrently  operating  processors,13  estimating  the  execution  time  of  these  problems  becomes 
difficult.  One  major  source  of  overhead  is  processor  communication.  Communication  between 
Topologix  processors  is  very  inefficient,  as  shown  in  figure  5,  which  depicts  measured  data  rates 
versus  number  of  bytes.  The  data  channels  are  rated  at  20  million  bits  per  second  (Mbps).  The 
channel  is  only  3 1  percent  utilized  with  the  maximum  amount  of  data  (8  kbytes).  The  poor  channel 
utilization  is  related  to  the  Topologix  operating  system  and  is  not  inherent  to  hypercube 
architectures. 

The  speedup  of  an  ideal  algorithm  implementation  would  be  p  (number  of  processors 
used),  with  each  processor  performing  1/p  operations  of  the  total  problem  (load  balanced).  Figure 
6  is  a  graphical  representation  of  table  4.  It  shows  the  ideal  speedup  that  could  be  achieved  with 
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the  QR  algorithm.  The  speedup  curve  in  the  figure  is  ideal  because  it  does  not  include  any 
overhead  (i.e.,  interprocessor  communication,  memory  accesses,  or  synchronization). 

Table  4.  Order  of  Operations 


HflUESm 

Calculations 

Communications 

Sequential 

2n4 

— 

Parallel 

5n 

The  QR  implementation  described  is  not  load  balanced  because  of  the  Householder 
reduction;  there  are  not  enough  matrix  elements,  after  the  reduction,  to  spread  over  the  processors. 
Fox14  proposes  a  scattered  decomposition  for  performing  the  Householder  reduction  to  preserve 
load  balance.  The  scatter  method,  however,  adds  more  communication,  0(n  ),  which  can  be 
devastating  for  a  high  overhead  system  like  the  Topologix. 


Figure  5.  Measured  Data  Transfer  Rates 
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Number  of  Matrix  Elements 
Figure  6.  Order  Of  Operations  and  Ideal  Speedup 

The  QR  factorization  uses  eigenstructure  techniques  that  need  at  least  an  order  of  magnitude 
more  computations  than  the  calculation  of  the  adaptive  weights.  Therefore,  concentrating  on  the 
eigenstructure  techniques  is  one  approach  to  a  parallel  implementation  of  the  EMVDR  application. 
The  QR  decomposition  (preceded  by  Householder  reduction)  described  above  does  not  provide 
load  balancing.  In  reference  15,  the  QR  algorithm  is  replaced  by  an  iterative  sequence  that 
converges  to  the  source  subspace  eigenstructure.  This  method  requires  many  iterations  and 
appears  to  be  difficult  to  program  onto  a  hypercube  architecture  efficiently;  however,  research  into 
other  eigenstructure  methods  for  hypercube  architectures  is  proceeding16  and  should  be  studied  for 
use  in  the  EMVDR  application. 

The  hypercube  architecture  can  solve  linear  algebra  algorithms  efficiently  (such  as  those 
used  in  signal  processing  applications)  as  well  as  provide  modular  growth  for  future  computing 
needs.  The  continuing  study  of  these  architectures  should  focus  on  such  systems  as  the 
Connection  Machine  and  the  NCUBE  6400  Series.  These  systems  use  faster  processing  engines, 
faster  communication  paths,  and  different  operating  systems  than  the  Topologix  system. 

Future  research  should  also  concentrate  on  the  relationship  between  parallel  architectures 
and  eigenstructure  methods.  Eigenstructure  algorithms  should  be  tailored  to  architectures  to  exploit 
hardware  assets  (e.g.,  processing  power,  distributed  memory,  or  networking  ability).  Ongoing 
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work  in  linear  algebra  algorithms  (including  eigenstructure  techniques)  on  various  architectures  in 
the  academic  and  industrial  communities  should  continue  to  be  supported  by  the  Navy. 
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APPENDIX  A 

SOURCE  CODE  FOR  EMVDR  APPLICATION 
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/****************************************************************** 
EMVDR  IMPLEMENTATION  ON  THE  TOPOLOGIX  HYPERCUBE 

AUTHOR:  Manfred  Leonhardt 

THIS  PROGRAM  PERFORMS  THREE  MAIN  FUNCTIONS: 


1)  GENERATES  RANDOM  GAUSSIAN  DATA  TO  BE  USED  AS  SENSOR 

OUTPUT, 

2)  PARALLEL:  QR  DECOMPOSITION,  RQ  MULTIPLY,  AND  TQ 

MULTIPLY  TO  GET  EIGENVALUES  AND  EIGENVECTORS,  AND 

3)  ENHANCED  MINIMUM  VARIANCE  DISTORTIONLESS  RESPONSE 

ADAPTIVE  BEAMFORMING  AS  SPECIFIED  IN  NUSC  TR  8305 
NOVEMBER  1988. 


THIS  PROGRAM  EXECUTES  ON  THE  TOPOLOGIX  SYSTEM  WITH  16 
PROCESSING  NODES  (4-DIM  HYPERCUBE).  IT  IS  LOADED  WITH 
THE  COMMAND: 


loadgo  al5-0  -w  2500000  EMV 

********  **************************************************** ******/ 


#include  <logixos/events.h> 
#include  <logixos/net.h> 
#include  <stdio.h> 

#include  <math.h> 

#include  <sys/time.h> 
#include  "complex.h" 
#include  "matrix.h" 

#include  "QR.h" 

#include  "QRparallel.h" 
#include  "bcast.h" 


/*  TOPOLOGIX  EVENTS  FOR  MESSAGE  PASSING  */ 
/*  MESSAGE  PASSING  NETWORK  MESSAGES  */ 
I*  OUTPUT  VARIABLES  AND  FLAGS  */ 
/*  MATH  FORMULAS  ASSIGNED  VALUES  */ 
/*  TIME  OF  DAY  FOR  RANDOM  SEED  */ 
/*  COMPLEX  NUMBERS  AND  MATH  ROUTINES  */ 
/*  MATRICES  AND  VARIOUS  ROUTINES  */ 
/*  ROUTINES  FOR  THE  HOUSEHOLDER  ALGORITHM  */ 
I*  ROUTINES  &  STRUCTURES  FOR  TOPOLOGIX  */ 
/*  BROADCAST  AND  COLLECT  DATA  ON  TOPOLOGIX  */ 


/*  DEFINITIONS  THAT  ARE  NEEDED  FOR  THE  EMVDR  ALGORITHM  */ 
#define  freq  100. 

#define  wavelength  1500. /freq 

#define  spacing  wavelength  /  2. 

#define  degrad  *M_PI/180.  /*  DEGREE  TO  RADIAN  CONVERSION  */ 

#define  p  16  /*  NUMBER  OF  PROCS  PROGRAM  WILL  EXECUTE  ON  */ 

#define  pres  4  /*  NUMBER  OF  ROWS  FOR  PROCS:  SQRT  (P)  ABOVE  */ 

enum  boolean  {TRUE,  FALSE}; 

I*  ********************************************4^**************************** 


Calculate_H  (Umat,  col,  Hmat,  n) 
complex_matrix  Umat,  Hmat; 

int  col,  n; 

/*  . 

THIS  ROUTINE  WILL  GENERATE  THE  H  MATRIX  NEEDED  TO 
CALCULATE  THE  Q  MATRIX  TO  SOLVE  THE  QR  DECOMPOSITION. 

INPUT: 

Umat :  type  complex  matrix 
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Contains  the  columns  that  will  be  multiplied 
to  form  the  H  matrix. 

col  :  type  integer 

Column  of  interest  in  the  U  matrix. 

n  :  type  integer 

Number  of  columns/rows  in  the  H  matrix. 


OUTPUT: 


Hmat  :  type  complex  matrix 

Contains  the  "result  of  :  I-(U*Uh)/U(l,l). 

THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 

conj,  cmult,  and  cdiv. 


complex  tempc; 
int  i,j; 


/*  -  PERFORM  I  -  (U  *  Uh  /  Uii)  . 

if  (col  !=  n-1) 

{ 

for  (i  =  col;  i  <  n;  ++i) 

for  (j  =  col;  j  <  n;  ++j) 

{ 

tempc  =  conj  (Umat(j][col]); 

Hmat[i][j]  =  cmult  (Umat[i][col],  tempc); 

Hmat{i][j]  =  cdiv  (Hmat[i](j],  Umat[col][col]); 

/*  —  SUBTRACT  RESULT  FROM  THE  IDENTITY  MATRIX 
Hmat[i]|j].im  =  -Hmat[i][j].im; 

if  (i  =  j) 

Hmat[i][i].re  =  1.  -  Hmat[i][i].re; 
else 

Hmat[i](j]  Je  =  -Hmat[i][j].rc; 

} 

/*  —  FILL  IN  OTHER  PART  OF  H  MATRIX  TO  IDENTITY  MATRIX. 

for  (i  =  0;  i  <  col;  ++i) 
for  (j  =  0;  j  <  n;  ++j) 

{ 

Hmat[i][j].im  =  0.; 

Hmat[j][i].im  =  0.; 
if  (i  “  j) 

Hmat[i][i].re  =  1.; 
else 
{ 

Hmat[i][j].re  =  0.; 

Hmat[j][i].re  =  0.; 

} 

} 
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/*  **************************************************************************  */ 


Parallel_Matrix_Mult  (first_matrix,  second_matrix,  result,  rows,  proc_n  umbers) 
complcx_matrix  first_matrix,  second_matrix,  result; 

int  proc_numbers; 

/*  . 

THIS  ROUTINE  WILL  PERFORM  MATRIX-MATRIX  MULTIPLICATION 
IN  PARALLEL  BY  USING  MULTIPLE  PROCESSORS  ON  THE 
TOPOLOGIX  SYSTEM.  THE  TOTAL  MATRIX  IS  ASSUMED  TO  BE  IN 
EVERY  PROCESSOR.  EACH  PROCESSOR  WILL  WORK  ON  A  SQUARE 
SUBBLOCK  (size:  #  of  rows  in  matrix  /  #  of  processors  in  row)  OF  THE 
ORIGINAL  MATRIX. 


INPUT: 

first_matrix  : 

secondmatrix  : 

rows  : 

procnumbers  : 
OUTPUT: 

result : 


type  complex  matrix 

Upper  triangular  matrix  from  QR  decomposition, 
type  complex_matrix 

Normalized  matrix  from  QR  decomposition, 
type  integer 

Number  of  rows  that  are  in  the  Q  or  R  matrices, 
type  integer 

Number  of  rows  or  columns  of  processors  (sqrt(p)). 


type  complex_matrix 

Diagonal  contains  the  eigenvalues  of  the  matrix 
that  was  decomposed  by  the  QR  decomposition. 


THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 


Meshid,  cmult,  and  cadd. 


int  Ablockcol,  Ablockrow,  Bblockcol,  Bblockrow; 

int  icol,  irow,  il,  i2,  j  1 ,  j2,  count,  sp; 

complex  tempc; 


*/ 


Meshid  (&irow,  &icol); 

Ablockrow  =  irow; 

Ablockcol  =  irow; 

Bblockrow  =  irow; 

Bblockcol  =  icol; 

sp  =  rows  /  proc.numbers; 

for  (count  =  0;  count  <  proc_numbers;  ++count) 

{ 

for  (il  =  Ablockrow*sp;  il  <  (Ablockrow* sp)+sp;  -H-il) 
for  (j2  =  Bblockcol*sp;  j2  <  (Bblockcol*sp)+sp;  ++j2) 

for  (i2=Bblockrow*spjl=Ablockcol*sp;i2<(Bblockrow*sp)+sp;  -H-i2,+4jl) 

( 

tempc  =  cmult  (first_matrix(il][jl),  second_matrix[i2](j2]); 
iesult[il][j2]  =  cadd  (tempc,  result[il][j2J); 
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} 

-H-Ablockcol; 

if  (Ablockcol  >=  proc_numbers)  Ablockcol  =  0.; 

-H-Bblockrow; 

if  (Bblockrow  >=  proc_numbers)  Bblockrow  =  0.; 

} 

/*  **************************************************************************  */ 

QR_Iteration  (Cmat,  Tmat,  n) 
complex_matrix  Cmat,  Tmat; 

int  n; 

/*  . 

THIS  ROUTINE  WILL  PERFORM  A  QR  DECOMPOSITION  USING  p 
PROCESSORS  IN  A  PARALLEL  FASHION.  THE  MATRIX  IS 
BROADCAST  TO  ALL  PROCESSORS  WITH  EACH  WORKING  ON  A 
SMALL  PART  OF  THE  MATRIX.  WHEN  THE  ROUTINE  IS  COMPLETE 
ALL  THE  PROCESSORS  WILL  CONTAIN  THE  Q,  R,  T  (accumulation  of 
Qs),  and  Eigs  (eigenvalues  of  matrix).  THIS  ROUTINE  WAS  TAKEN  FROM 
THE  LINPACK  USER'S  GUIDE,  CHAPTER  9. 

INPUT: 

Cmat  :  type  complex_matrix  Current  covariance  matrix 

Tmat :  type  complex^matrix 

Identity  matrix  first  time  through  and  accumulation  of  Qs 
for  every  other  time. 

n  :  type  integer 

Number  of  rows  in  the  Cmat  matrix. 

OUTPUT: 

Cmat :  type  complex  matrix 

Matrix  with  diagonal  values  being  the  eigenvalues 
of  the  covariance  matrix. 

Tmat :  type  complex_matrix 

Matrix  containing  the  accumulation  with  the  new  Q 
matrix.  The  eigenvectors  of  the  covariance  matrix. 

THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 

Broadcast,  Calculate_H,  cassign,  cdiv,  cmat  equate,  cmatmult, 
cmult.  Exchange,  ParaIlel_Matrix_Mult,  receive,  send,  zaxpy,  zdot, 
znrm2,  and  zscal 


int  k,  1,  i,  j,  loopstart,  xcount; 

int  iam,  usize,  prow,  pcol,  ihave; 

float  temp,  Enrm; 

complex  sigma,  nrmc,  tempc,  cnst,  t; 

complexjmatrix  Rmat,  Umat,  Hmat,  HI  mat,  H2mat,  Tempmat; 

FILE  *data,  *fopen  0; 


A-6 


TR  8807 


usize  =  sizeof(Umat);  /*  GET  SIZE  FOR  MESSAGE  PASSING  */ 

/* - GET  THE  VIRTUAL  NODE  ID  FOR  THE  COMPUTATIONS -  */ 

iam  =  vl6_node_number[getncxieid0]; 

/*  -  TRANSFER  THE  COVARIANCE  MATRIX  FOR  ROUTINE  -  */ 

cmat_equate  (Rmat,  Cmat,  n); 

/* - SEND  OUT  THE  ENTIRE  MATRIX  TO  ALL  PROCESSORS  -  */ 

Broadcast  (Rmat,  usize,  0); 

/*  —  START  THE  MAIN  COMPUTATIONS  FOR  HOUSEHOLDERS  —  */ 
for  (1  =  0;  1  <  n-1;  ++1) 

k  =  1  %  p;  /♦  WHAT  PROCESSOR  HOLDS  THE  NEEDED  COLUMN  ?  */ 

/*  —  GET  HYPERCUBE  NODE  ID  FOR  BROADCAST  ROUTINE  —  */ 
ihave  =  vl6_to_hypercube_number[kJ; 

if  (iam  =  k) 

{  /*  WHAT  PROCESSOR  NEEDS  TO  CALCULATE  U  COLUMN  ?  */ 

Enrm  =  znrm2  (n-1,  Rmat,  1,  n);  /*  NORM  OF  COLUMN  L  */ 

if  (Enrm  !=  0) 

{ 

if  (Rmat[l][l].re  !=  0.) 

/*  -  GET  SIGN  FOR  U  VECTOR  CALCULATION  -  */ 

temp  =  cabs  (Rmat[l][l]); 
sigma.re  =  Rmat[l][l}.Te/temp; 
sigma,  im  =  Rmat[l][l].im/temp; 

} 

else 

sigma  =  cassign  (1.,  0.); 

nrmc  -  cassign  (sigma.re*Enrm,  sigma.im*Enrm); 
tempc  =  cassign  (1„  0.); 
cnst  =  cdiv  (tempc,  nrmc); 

/*  -  CONSTRUCT  U  VECTOR  -  */ 

zscal  (n-1,  cnst,  Rmat,  1,  n,  Umat); 

Umat[l][l].re  =  Umat[l][l].rc  +  1.; 

Rtnat[l][l]  =  cassign  (-nimc.re,  -nrmc.im); 
for  (i  =  1+1;  i  <  n;  ++i) 

Rmat[i][l]  =  cassign  (0.,  0.); 
for  (i  =  0;  i  <  1;  ++i) 

Umat[i][l]  =  cassign  (0.,  0.); 

} 

} 

/*  -  SEND  OUT  U  MATRIX  TO  ALL  PROCESSORS  -  */ 

Broadcast  (Umat,  usize,  ihave); 


if  (iam  ==  k) 

{  /*  PROCESSOR  THAT  CALCULATED  U  VECTOR  */ 

if  (Enrm  !=  0) 

{  /*  FINISH  CALCULATING  OTHER  COLS  OF  R  MATRIX  */ 

for  (j  =  1+p;  j  <  n;  j  +=  p) 
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} 

else 

{ 


}  /* 


}/* 


{ 

t  =  zdot  (n-1,  Umat,  1,  Rmat,  j,  n); 

t  =  cdiv  (t,  Umat[l][l]); 

tempc.re  =  -1; 

tempc.im  =  0.; 

t  =  cmult  (tempc,  t); 

zaxpy  (n-1,  t,  Umat,  1,  Rmat,  j,  n); 

} 


/*  PROCESSOR  IS  NOT  EQUAL  TO  K  */ 


loopstart  =  iam; 
xcount  =  1; 
while  (loopstart  <  1) 

{ 

loopstart  =  (xcount  *  p)  +  iam; 

++xcount; 

} 

for  (j  =  loopstart;  j  <  n;  j  +=  p) 

{ 

t  =  zdot  (n-1,  Umat,  1,  Rmat,  j,  n); 
t  =  cdiv  (t,  Umat[l][l]); 

tempc.re  =  -1.;  /*  CREATE  NEGATIVE  #  */ 


tempc.im  =  0.; 

t  =  cmult  (tempc,  t);  /*  NEGATE  t  */ 

zaxpy  (n-1,  t,  Umat,  1,  Rmat,  j,  n); 

- -  END  OF  THE  IF  ELSE  LOOP  -  */ 

-  end  OF  THE  1  LOOP  -  */ 


/*  -  IF  PROCESSOR  ZERO,  EQUATE  MATRIX  Tempmat  TO  IDENTITY  -  */ 
if  (iam  =  0) 

{ 

for  (i  =  0;  i  <  n;  ++i) 
for  (j  =  0;  j  <  n;  ++j) 

{ 

Tempmat[i][j].im  =  0.; 

if(i=j) 

Tempmat[i][i].re  =  1.; 
else 

Tempmat[i][j].re  =  0.; 

1 

} 

/*  —  COLLECT  H  MATRICES  &  GENERATE  Q  MATRIX  IN  NODE  0  —  */ 
for  (1  =  n-p+iam;  1  >=  iam;  1  -=  p) 

/*  -  CALCULATE  YOUR  H  MATRIX  .  */ 

Calculate_H  (Umat,  1,  Hmat,  n); 
if  (receive  (R,  usize,  Hlmat,  iam,  0)  !=  -1) 

/*  -  CREATE  TEMPORARY  MATRIX  MULTIPLICATION  HOLDER  - 
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for  (i  =  0;  i  <  n;  ++i) 
for  (j  =  0;  j  <  n;  ++j) 

H2mat[i][j].im  =  0.; 

if  (i  =  j) 

H2mat[i][i].re  =  1.; 
else 

H2mat[i][j].re  =  0.; 

/*  -  MULTIPLY  THE  H's  TOGETHER  .  */ 

cmat_mult  0,  n,  Hmat,  HI  mat,  H2mat); 

/*  —  SWAP  THE  MATRICES  FOR  THE  NEXT  MULTIPLY  —  */ 

cmat_equate  (Hmat,  H2mat,  n); 

) 

/* - SEND  OUT  THE  MULTIPLICATION  RESULT  TO  LEFT  - —  */ 

send  (L,  usize,  Hmat,  iam,  0); 

if  ((iam  %  4)  —  0) 

{  /*  IF  YOU'RE  IN  COLUMN  ZERO  THEN  PASS  DATA  UP  */ 

if  (receive  (D,  usize,  Hlmat,  iam,  0)  !=  -1) 

{ 

cmat_mult  (1,  n,  Hmat,  Hlmat,  H2mat); 
cmat_equate  (Hmat,  H2mat,  n); 

send  (U,  usize,  Hmat,  iam,  0); 
if  (iam  ==  0) 

{ 

cmat_mult  (1,  n,  Hmat,  Tempmat,  H2mat); 
cmat_equate  (Tempmat,  H2mat,  n); 

}  1 

} 

/*  -  CLEAR  THE  UNIMPORTANT  COLUMNS  -  */ 

for  0  =  0;  1  <  n;  ++1) 

{ 

if  (1<P) 

{ 

if  (1  !=  iam) 

{ 

for  (i  =  0;  i  <  n;  ++i) 

Rmat[ij[l]  =  cassign  (0.,0.); 

)  1 

else 

{ 

if  (0  %  (l/p)*p)  !=  iam) 

for  (i  =  0;  i  <  n;  ++i) 

Rmat[ij[l]  =  cassign  (0.,0.); 

) 

} 

} 
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/*  - GET  the  entire  r  matrix  into  every  PROCESSOR  -  *1 

Exchange  (Rmat,  usize,  n); 

/*  -  BROADCAST  THE  ENTIRE  Q  MATRIX  TO  EVERY  PROCESSOR  -  */ 
Broadcast  (Hmat,  usize,  0); 

/*  .  CLEAR  THE  EIGENVALUE  MATRIX  .  */ 

for  (i  =  0;  i  <  n;  ++i) 
for  (j  =  0;  j  <  n;  ++j) 

Cmat[i][j3  =  cassign  (0.,  0.); 

I*  .  PERFORM  THE  RQ  MULTIPLICATION  .  */ 

ParalleLMatrix_Mult  (Rmat,  Hmat,  Cmat,  n,  pres); 

/*  -  PUT  EIGENVALUES  INTO  EVERY  PROCESSOR  .  */ 

Exchange  (Cmat,  usize,  n); 

/*  .  CLEAR  TEMPORARY  MATRIX  .  */ 

for  (i  =  0;  i  <  n;  ++i) 

for  (j  =  0;  j  <  n;  ++j) 

Tempmatni’1  =  cassign  (0.,  0.); 

/*  ACCUMULATE  TRANSFORMATION  MATRICES  (EIGENVECTORS)  */ 
ParaUel_Matrix_Mult  (Tmat,  Hmat,  Tempmat,  n,  pres); 

/*  -  PUT  EIGENVECTORS  INTO  EVERY  PROCESSOR  -  */ 

Exchange  (Tempmat,  usize,  n); 
cmat_.equate  (Tmat,  Tempmat,  n); 


/*  **************************************************************************  *j 

Update_data  (eigvals,  trans,  inputs,  alpha,  size) 
complex_matrix  eigvals,  trans; 

complex_vector  inputs;  /*  Input  vector  */ 

float  alpha;  /*  Forgetting  factor  for  the  covariance  update  */ 

int  size;  /*  Number  of  rows  or  columns  in  trans  *1 

/*  . 

THIS  ROUTINE  WILL  UPDATE  A  COVARIANCE  MATRIX  AND  THE 
COMPLEX  INPUTS  THAT  HAVE  PREVIOUSLY  BEEN  GENERATED. 

INPUT: 

eigvals  :  type  complex_matrix 

Originally  the~ covariance  matrix  that  is  complex 
Hermitian.  Later  it  is  the  current  eigenvalue  estimation 
matrix  (eigenvalues  on  diagonal). 

trans  :  type  complex_matrix 

Eigenvectors  of  the  covariance  matrix  to  be  used  to  update 
the  new  inputs. 

inputs  :  type  complex_vector 

New  inputs  that  the  hydrophones  have  received. 
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alpha  :  type  float 

Forgetting  factor  for  the  covariance  update  equation. 

size  :  type  integer 

Number  of  rows  in  the  input  vector. 

OUTPUT: 

eigvals  :  type  complex_matrix 

New  updated 'covariance  matrix  eigenvalue  estimation, 
includes  new  data  information. 

inputs  :  type  complex_vector 

New  data  that- has  been  updated  to  conform  to  the  last 
eigenvectors  calculated. 

THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 
cmat_add,  cmat  trans,  cmatvect_mult,  cms_mult,  and  cvect_equate. 


int  i,  j; 

complex  thold; 

complex  jmatrix  transh,  result,  ans  1 ,  ans2; 

complex_vector  updata; 

/*  -  GET  HERMITIAN  TRANSPOSE  OF  TRANSFORMATION  MATRIX  -  */ 
cmat_trans  (trans,  transh,  size); 

/*  .  UPDATE  INPUT  VECTOR  - -  */ 

cmatvect_mult  (transh,  inputs,  updata,  size); 
cvect_equate  (updata,  inputs,  size); 

/*  -  UPDATE  COVARIANCE  MATRIX  USING  ABOVE  RESULT  -  */ 

for  (i  =  0;  i  <  size;  ++i) 

for  (j  =  0;  j  <  size;  ++j) 

thold  =  conj  (updataU']); 
result[ij[j]  =  cmult  (updata[i],  thold); 

cms_mult  (result,  alpha,  ansi,  size); 
cms_mult  (eigvals,  1. -alpha,  ans2,  size); 
cmat_add  (ansi,  ans2,  eigvals,  size); 


/ *  **************************************************************************  */ 

Normal  (number  1,  mean,  sd,  outreals) 
float  mean,  sd; 

float  outrealsQ; 

int  number  1; 

I*  . 

THIS  ROUTINE  WILL  GENERATE  AN  ARRAY  OF  RANDOM  NUMBERS 
WITH  A  GAUSSIAN  DISTRIBUTION. 

INPUT 
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numberl  :  type  integer 

Number  of  values  to  be  in  the  resulting  array. 

mean  :  type  float 

Average  of  the  values  to  be  generated, 
sd  :  type  float 

Standard  deviation  of  the  data  to  be  generated. 

OUTPUT 

outreals  :  type  float  (vector) 

Will  contain  values  with  Gaussian  distribution. 

.  */ 

{ 

#defme  to31  2147483648.0 
#define  ix  71365 

#define  FRAND  (float)((float)randoin()/(float)MAXLONG) 
long  int  iu; 

double  sqrt  (),  log  (),  erand48  (); 

float  root,  unifl,  unif2,  temp,  gaussl,  gauss2; 

int  i; 

iu  =  FRAND; 

for  (i  =  0;  i  <  numberl;  i+=2) 

{ 

do 

( 

iu  =  ix  *  iu  +  1,  /*  FILL  UP  ARRAY  WITH  RANDOM  NOISE  */ 

unifl  =  iu/to31; 

unif2  =  2.  *  FRAND  -  1.; 

temp  =  unifl  *  unifl  +  unif2  *  unif2; 

}  while  (temp  >  1.); 

root  =  sqrt  (-2.*log  (temp)  /  temp); 

gaussl  =  unifl  *  root; 

gauss2  =  unif2  *  root; 

outreals[i]  =  mean  +  sd  *  gaussl; 

outreals[i+l]  =  mean  +  sd  *  gauss2; 

} 

} 

/*  **************************************************************************  *f 

SteerGen  (arraysize,  MRA,  steer) 
int  arraysize; 

float  MRA; 

complex_vector  steer, 

/♦  . 

THIS  ROUTINE  GENERATES  THE  STEERING  VECTOR  FOR  THE  LOOK 
DIRECTION  SPECIFIED. 

INPUT: 

arraysize  :  type  integer 

Number  of  hydrophones  in  the  array. 
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MRA  :  type  float 

Bearing  of  the  look  direction  (degrees). 


OUTPUT: 

steer  :  type  complex  vector 

Will  contain  the  vector  for  the  look  direction. 


THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 


cassign. 


{ 

int  i; 

float  ang,  angle; 

double  cos(),  sin(); 

ang  =  sin  (MRA*M_PI/180.); 
for  (i  =  O,  i  <  arraysize;  ++i) 

{ 

angle  =  i  *  ang  *  M_PI; 

steer[i]  =  cassign  (cos(angle),  -sin(angle)); 

}  } 


*/ 


/*  **************************************************************************  * j 


Processlnputs  (number_of_sensors,  how_many_sources,  signal_db_values,  signal_bearings, 
time_interval,  sensor_values) 

int  time_interval,  how_many_sources,  number_of_sensors; 

tloat  signal_db_valuesG,  signal_bearings[]; 

complex_vector  sensor_values; 

/*  . 

THIS  ROUTINE  WILL  PROVIDE  SENSOR  OUTPUTS  BY  USING  THE 
SENSOR  INPUTS  THAT  HAVE  BEEN  PREVIOUSLY  GENERATED. 


INPUT: 

number_of_sensors 


how_many_sources 


signal_db_values 


type  integer 

Contains  number  of  hydrophones  per  beam, 
type  integer 

Specifies  number  of  sources  present  to  hydrophones, 
type  float  (array) 

Contains  decibel  values  assigned  to  each  source. 


signal_bea rings  :  type  float  (array) 

Contains  the  bearing  for  each  signal  present. 


timejnterval  :  type  integer 

Contains  time  interval  in  question.  Used  for  sigvar 
and  noise  indices. 


OUTPUT: 
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sensor_values  :  type  complex_vector 

Will  contain  hydrophone  outputs  that  were  generated. 

THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 


cadd,  cassign,  cmult,  Normal,  and  SteerGen. 


{ 


int 

complex 

float 

complex_vector 

float 

double 

static  enum  boolean 


i,  j,  k; 

signal[20][400],  noise[400],  temp; 
nreal[400],  nimag[400],  sreal[400],  simag[400]; 
steervect,  temp2; 
sigma; 

pow  (),  sqrt  0; 
first  =  TRUE; 


*/ 


for  (i  =  0;  i  <  how_many_sources  &&  first  ==  TRUE;  ++i) 

{ 

/*  GENERATE  SIGNAL  POWERS  ASSUMING  NOISE  VARIANCE  =  1  */ 

sigma  =  pow  (10.,  signal_db_values[i]/20.)  /  sqrt(2.); 

Normal  (400, 0.,  sigma,  sreal); 

Normal  (400, 0.,  sigma,  simag); 
for  (k  =  0;  k  <  400;  ++k) 

signal[i][k]  =  cassign  (sreal[k],  simag[k]); 

} 

first  =  FALSE; 

/*  .  GENERATE  THE  NOISE  FOR  THE  HYDROPHONES  -  */ 

Normal  (number_of_sensors,  0.,  l/sqrt(2.),  nreal); 

Normal  (number_of_sensors,  0.,  l/sqrt(2.),  nimag); 
for  (k  =  0;  k  <  number_of_sensors;  ++k) 

noise[k]  =  cassign  (nreal[k],  nimagfk]); 

/*  -  USE  VALUES  TO  GENERATE  INPUT  VECTOR  -  *1 

for  (k  =  0;  k  <  number_of_sensors;  ++k) 
temp2[k]  =  cassign  (0.,  0.); 
for  (j  =  0;  j  <  how_many_sources;  ++j) 

{ 

/*  -  GENERATE  STEERING  VECTOR  FOR  EACH  SOURCE  BEARING  -  */ 
SteerGen  (number_of_sensors,  signal_bearings[j],  steervect); 
for  (k  =  0;  k  <  number_of_sensors;  ++k) 

{ 

temp  =  cmult  (s<  gnal[k][time_interval],  steervect[k]); 
temp2[k]  =  cadd  (temp2[k],  temp); 

} 

) 

/♦  -  COMPLETE  THE  GENERATION  OF  INPUT  VECTOR  -  */ 

for  (i  =  0;  i  <  number_of_sensors;  ++i) 

sensor_values[i]  =  cadd  (noisefi],  temp2[ij); 


j*  i**************************************************************************  */ 


Inputvalues  (file,  howmany,  SINR,  bearings) 
float  SINR[],  bearings[]; 

FILE  *file; 
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int  howmany; 

/*  . 

THIS  ROUTINE  WILL  READ  IN  DECIBEL  VALUES  AND  BEARINGS  OF 
ANY  SIGNALS  THAT  ARE  PRESENT  IN  THE  SYSTEM. 

THE  FILE  SHOULD  HAVE  THE  FOLLOWING  FORMAT: 

dB_value  space  angle_value 

WHERE  THE  angle_value  IS  BETWEEN  -90  DEGREES  AND  90  DEGREES. 
INPUT: 

file  :  type  FILE 

Pointer  to  the  input  file.  Must  be  opened  before  this 
routine  is  called. 

howmany  :  type  integer 

Number  of  signals  that  are  present  to  the  system. 

OUTPUT: 

SINR  :  type  float  (vector) 

Decibel  values  for  all  signals  present. 

bearings  :  type  float  (vector) 

Angle  locations  for  all  the  present  signals. 


{ 

int  i; 

for  (i  =  0;  i  <  howmany;  ++i) 

fscanf  (file,  "%f  %f &SINR[i],  &bearings[i]); 

} 

j*  **************************************************************************  */ 

EMV  (dominant_values,  efactor,  arraysize,  eigvects,  eigvals,  steer,  weights) 

int  dominant_values,  arraysize; 

float  efactor, 

complex_vector  steer,  weights; 

complexjmatrix  eigvects,  eigvals; 

I*  . 

THIS  ROUTINE  WILL  GENERATE  THE  WEIGHTS  TO  MINIMIZE  THE 
OUTPUT  RESPONSE  OF  A  BEAMFORMER  BY  USING  THE 
EIGENVECTORS  AND  EIGENVALUES  OF  THE  COVARIANCE  MATRIX. 
THE  FORMULA  IS  TAKEN  FROM  NUSC  TR  8305,  N.  OWSLEY, 
NOVEMBER  1988. 

INPUT: 

dominant_values  :  type  integer 

Number  of  dominant  signals  present. 

efactor  :  type  float 

Enhancement  factor  for  EMV  usually  >=  1. 
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arraysize  :  type  integer 

Number  of  hydrophones  in  the  array. 

eigvects  :  type  complex  matrix 

Eigenvectors  of  the  covariance  matrix. 

eigvals  :  type  complex  matrix 

Eigenvalues  of  the  covariance  matrix. 


steer  :  type  complex  vector 

Steering  vector  for  look  direction. 


OUTPUT: 

weights  :  type  complex  vector 

Will  contain  weights  to  minimize  beam  response. 

THIS  ROUTINE  USES  THE  FOLLOWING  EXTERNAL  ROUTINES: 

cabs,  cadd,  cassign,  cmatadd,  cmatvect_mult,  cms_mult, 
cmult,  and  "conj. 


int  i,  j,  k; 

float  temp,  part5,  denom,  scalar, 

complex  tc,  part3,  part4; 

complexjmatrix  part,  part2,  old,  numer, 

ctnat_init  (old,  arraysize); 
denom  =  0.; 

for  (i  =  0;  i  <  dominant_values;  ++i) 

{ 

/*  -  get  SCALAR  MULTIPLIER  eu/(l+eu) 

temp  =  efactor  *  (eigvals[i][i].re  - 1.); 
scalar  =  temp  /  (1.  +  temp); 

/*  .  SOLVE  NUMERATOR  PARTIAL  — 

for  (j  =  0;  j  <  arraysize;  ++j) 

for  (k  =  0;  k  <  arraysize;  ++k) 

{ 

tc  =  conj  (eigvects[k][i]); 
part[j][k]  =  cmult  (eigvects[j][i],  tc); 

} 

cms_mult  (part,  -scalar,  part2,  arraysize); 
cmat_add  (part2,  old,  old,  arraysize); 

/*  -  SOLVE  DENOMINATOR  PARTIAL  - 

part4  =  cassign  (0.,  0.); 
for  (j  =  0;  j  <  arraysize;  ++j) 

{ 

tc  =  conj  (eigvects(j][i]); 
part3  =  cmult  (tc,  steer[j]); 
part4  =  cadd  (part4,  part3); 

} 

part5  =  cabs  (part4)  *  cabs  (part4); 
part5  *=  scalar, 


*1 


* / 


*1 


*/ 
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denom  +=  part5; 
denom  =  arraysize  -  denom; 

/*  -  COMPLETE  NUMERATOR  COMPUTATIONS  -  */ 

for  (i  =  0;  i  <  airaysize;  ++i) 

for  (j  =  0;  j  <  arraysize;  ++j) 

if  (i=j) 

numer[i][i]  =  cassign  (l.+old[i][i].re,  old[i][i].re); 
else 

numer[i][j3  =  cassign  (old[i][j].re,  old[i][j].im); 


cmatvect_mult  (numer,  steer,  weights,  arraysize); 

/*  .  COMPLETE  WEIGHT  COMPUTATIONS  .  */ 

for  (i  =  0;  i  <  arraysize;  ++i) 

weights[i].re  /=  denom; 
weights[i].im  /=  denom; 

} 

} 

/**  FOLLOWING  ROUTINES  FOR  OUTPUTTING  GRAPH  TO  HP  SCREEN  **/ 
Beamfonm  (weights,  airaysize,  pattern) 


int  arraysize; 

complex_vector  weights; 

float  pattemQ; 

{ 

double  cos  (),  sin  (),  sqrt  (); 

int  i,  angle; 

float  theta,  number,  angle2; 

complex  sum; 

/*  . .  ANGLE  SPANS  -pi/2  TO  pi/2  .  */ 


for  (angle  =  -180;  angle  <=  180,  ++angle) 

sum  =  cassign  (0.,0.); 
for  (i  =  0;  i  <  arraysize;  ++i) 

theta  =  2.  *  M_PI  *  i  *  spacing  *  sin  ((angle/2.)  degrad); 
theta /=  wavelength; 

/*  FOR  ALL  HYDROPHONES:  WEIGHTS  ADDED  TOGETHER  */ 

sum.re  +=  weights[i].re*cos(theta)  -  weights[i].im*sin(theta); 
sunum  +=  weights[i].re*sin(theta)  +  weights[i].im*cos(theta); 

/*  —  MAGNITUDE  OF  SUM  (OF  WEIGHTS)  IS  CALCULATED  —  */ 
pattem[angle+180]  =  sqrt  ((sum.re  *  sum.re)  +  (sum.im  *  sum.im»; 

} 
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Normalize  (weights,  array_size,  pattern,  normalized_pattem) 


tnt 

complex_vector 

float 

float 

{ 


anay_size; 

weights; 

pattemQ; 

normalized_pattemO; 


double 

int 

float 


sqrt(),  log  10(); 
i,  angle; 

constant  =  0.,  mag; 


for  (i  =  0;  i  <  array_size;  ++i) 

{ 

mag  =  cabs  (weights[i]); 
constant  +=  mag; 

for  (angle  =  0;  angle  <=  360;  ++angle) 

normalized_pattem[angle]  =  20.  *  loglO  (pattern [angle]/constant); 


Output  (file,  normalized_pattem) 
char  *file; 

float  normalized_pattem[] ; 


int  angle; 

static  int  firsttime  =  0; 

float  angle2; 

FILE  *datafile,  *fopen  (); 


datafile  =  fopen  (file,  "w"); 

for  (angle  =  -180;  angle  <=  180;  -H-angle) 

if  (normalized_pattem[angle+180]  <  -50.)  /*  CHECK  IF  OFF  GRAPH  */ 

normalized_pattern[angle+180]  =  -50.; 

/*  OUTPUT  BOTH  ANGLE  &  DECIBEL  VALUE  AT  THAT  ANGLE  */ 
angle2  =  angle  /  2.; 

fprintf  (datafile,  "%f  %f\n",  angle2,  normalized_pattem[angle+180]); 
fclose  (datafile); 


/*  ***************************  MAIN  ROUTINE  ****************************  */ 


main  0 

{ 

FILE 

int 

float 

char 

vector 

complex_matrix 

complex_veetor 


♦fopen  0,  *data; 

i,  j,  arraysize,  sources,  dominant,  iam; 
pattem[400],  npattem[400); 

♦filename; 
sigvals,  bears; 
eigvects,  eigvalues; 
inputvector,  steering,  weights; 
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kinit(100);  /*  ATTACH  TO  THE  TOPOLOGIX  KERNEL  */ 

/*  -  GET  VIRTUAL  NODE  ID  FOR  COMPUTATIONS  - .  */ 

iam  =  vl6_node_number[getnodeid()]; 

/*  -  DEFINE  NUMBER  OF  HYDROPHONES  IN  ARRAY  .  */ 

arraysize  =  32; 
if  (iam  —  0) 

{ 

printf  ("\n\tHow  many  sources  are  presentTNn"); 

scanf  ("%  d",  &sources); 

data  =  fopen  ("DATAXX",  Mr"); 

for  (i  =  0;  i  <  arraysize;  ++i) 

for  (j  =  0;  j  <  arraysize;  ++j) 

fscanf  (data,"%f  %f',&eigvalues[i][j].re,&eigvalues[i][j].im); 
fclose  (data); 

) 

/*  —  INITIALIZE  EIGENVECTORS  TO  I  AND  WEIGHTS  TO  ONE  - —  */ 

for  (i  =  0;  i  <  arraysize;  ++i) 

{ 

weights[i]  =  cassign  (1.,  0.); 
for  (j  =  0;  j  <  arraysize;  ++j) 

if  (i=j) 

eigvects[i][i]  =  cassign  (1.,  0.); 

else 

eigvects[i][j]  =  cassign  (0.,  0.); 

} 

/*  -  read  in  values  needed  for  program  to  run  -  */ 

data  =  fopen  ("values",  "r"); 

Inputvalues  (data,  sigvals,  bears,  sources); 

fclose  (data); 

for  (i  =  0;  i  <  10;  ++i) 

/*  -  GENERATE  NEW  DATA  VECTOR  . .  */ 

Processlnputs  (arraysize,  sources,  sigvals,  bears,  i,  inputvector); 

/*  -  UPDATE  THE  EIGENVECTORS  WITH  NEW  DATA  -  */ 

Update_data  (eigvalues,  eigvects,  inputvector,  .8188,  arraysize); 

/*  -  PERFORM  QR  DECOMPOSITION  AND  RQ  MULTIPLICATION  - 

*/ 

QRJteration  (eigvalues,  eigvects,  arraysize); 

/*  -  GENERATE  THE  STEERING  VECTOR  -  */ 

SteeiGen  (arraysize,  0.,  steering); 

/*  ~  PERFORMEMVDR  ALGORITHM  WITH  CALCULATED  VALUES  —  */ 
dominant  =  5; 

EMV  (dominant,  1„  arraysize,  eigvects,  eigvalues,  steering,  weights); 

kexit(10);  /*  REMOVE  FROM  TOPOLOGIX  KERNEL  */ 

)  /*  -  END  OF  THE  MAIN  PROGRAM  -  */ 
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APPENDIX  B 

SOURCE  CODE  FOR  BROADCAST  AND  EXCHANGE  ROUTINES 


B-l/B-2 
Reverse  Blank 


I*  ********* . ********************** . ****** . ****** 

THE  TWO  ROUTINES  IN  THIS  FILE  ARE  THE  COMMUNICATION 
ROUTINES  THAT  ARE  NEEDED  BY  PROGRAMS  THAT  PASS  DATA 
ON  THE  TOPOLOGIX  SYSTEM. 

***************************** . ****************************  */ 

static  int  Which_Port[4]  =  {  DL_CB0,  DL_CB1,  DL_CB2,  DL_CB3}; 

Broadcast  (data,  size,  whohasthedata) 
int  "‘data,  size,  whohasthedata; 

/*  . - . 

THIS  ROUTINE  WILL  BROADCAST  DATA  THROUGH  THE 
HYPERCUBE.  THE  DATA  IS  ORIGINALLY  HELD  BY  ONE  PROCESSOR. 
THIS  ROUTINE  CAN  PERFORM  THE  BROADCAST  IN  0(d) 
OPERATIONS. 

INPUT: 

data  :  type  pointer 

Points  to  data  that  will  be  broadcast. 


size  :  type  integer 

Size  of  the  data  block. 

whohasthedata  :  type  integer 

Hypercube  processor  id 


int  bit,  whoiam=getnodeid(),  mask=OxFFFF,  orbit=Oxl,  whofrom; 
int  iamwho  =  whoiam; 

/*  .  SET  MESSAGE  FLAGS  NEEDED  . 

mesg.nh_type  =  0; 
mesg.nh_length  =  size; 
mesg.nh_flags  =  0; 
mcsg.nh_node  =  0; 
mesg.nh_msg  =  data; 

for  (bit  =  0;  bit  <  4;  ++bit)  /*  COVER  ALL  BITS  IN  ID  */ 

{ 

whoiam  &=  mask/*  MASK  OFF  UNNEEDED  BITS  FROM  SOURCE  ID  */ 
whohasthedata  &=  mask/*  MASK  OFF  UNNEEDED  BITS  */ 
mask  «=  1  /*  SHIFT  MASK  BIT  OVER  BY  ONE  TO  LEFT  */ 
whofrom  =  whoiam  A  orbit; 

orbit  «=  !/*  SHIFT  ORBIT  BIT  OVER  BY  ONE  TO  LEFT  */ 

mesg.nh_dl_event  =  Which_Port[bit]; 
mesg.nh_event  =  abs  (Which_Port[bitJ); 

if  (whoiam  —  whohasthedata) 
dsend  (&mesg); 

else  if  (whofrom  —  whohasthedata) 
drecv  (&mesg); 


)  /*  -  end  of  the  for  loop 

)  /*  -  end  OF  THE  PROGRAM  - 


*/ 


*/ 


*/ 

*/ 
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Exchange  (Xmatrix,  size,  matsize) 
int  size,  matsize; 

complex_matrix  Xmatrix; 

/*  . 

THIS  ROUTINE  EXCHANGES  DATA  HELD  IN  ALL  PROCESSORS 
WITH  ALL  THE  PROCESSORS. 


INPUT: 

Xmatrix  : 

size  : 


matsize  : 


type  complex_matrix 

All  vectors  but  the  ones  to  pass  should  be  set  to  zero, 
type  integer 

Specifies  size  of  Xmatrix  in  bytes, 
type  integer 

Specifies  #  of  rows  in  Xmatrix. 


OUTPUT: 

Xmatrix  :  type  complex_matrix 

Contains  all  data  from  all  procs. 


int  bit,  iam  =  getnodeidQ,  orbit  =  1 ,  i,  j; 

complex_matrix  X2matrix; 

static  struct  nmsg  mesg2; 


*/ 


} 


mesg.nh_type  =  mesg2.nh_type  =  0; 
mesg.nh_length  =  mesg2.nh_length  =  size; 
mesg.nh_flags  =  mesg2.nh_flags  =  0; 
mesg.nh_node  =  mesg2.nh_node  =  0; 
for  (bit  =  0;  bit  <  4;  ++bit) 

{ 

mesg.nh_event  =  abs  (Which_Port[bit]>; 
mesg2.nh_evsnt  =  abs  (Which_Port[bit]); 
if  ((iam  &  orbit)  !=  0)  /*  PLANE  ONE  */ 

{ 

mesg.nhjmsg  =  Xmatrix; 
mesg.nh_dl_event  =  Which_Port[bit]; 
dsend  (&mesg); 
mesg2.nh_msg  =  X2matrix; 
drecv  (&mesg2); 

} 

else  /*  .  PLANE  TWO 


{ 

mesg2.nh_msg  =  X2matrix; 
drecv  (&mesg2); 
mesg.nh_msg  =  Xmatrix; 
mesg.nh_dl_event  =  Which_Port[bit]; 
dsend  (&mesg); 

} 

cmat_add  (Xmatrix,  X2matrix,  matsize);  f*  PERFORM  DATA  EXCHANGE  */ 
orbit  «=  1; 


*/ 
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